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@ Pipelined apparatus and method. 

@ A pipeline data processor is simultaneously 
operable in a pipeline mode, a parallel mode 
and a vector mode which is a special case of the 
pipeline mode. Each pipeline stage has its own 
stage program counter 71, 72, 73, 74, 75 and 76. 
A global program counter 34 is incremented in 
the pipeline mode. The instruction addresses 
generated in the global program counter 34 are 
distributed to those pipeline stages which first 
become available to perfonn pipelined data 
processing. Any given pipeline stage may 
dynamically switch between pipeline mode and 
a parallel mode in which the stage program 
counter a>unts and supplies instruction addres- 
ses independently of any other pipeline stage. A 
vector mode uses pipeline Instructions which 
are repeated to enable any number of the 
pipeline stages to participate In vector calcu- 
lations. In the vector mode, one pipeline in- 
struction address is held in the global program 
counter to be repeatedly supplied to respective 
first avaOable pipeline stages untO the vector 
calculations are completed. 
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The present invention relates to data processing systems, and particularly to apparatus for processing 
instruction streams in a pipelined processor, and methods of operating pipelined processing apparatus. 

Pipelined processors, or "pipelining", have been used to obtain high performance at lower costs. In a 
pipelined system, the execution of a machine Instruction Is divided Into a number of execution stages such that 

5 several machine instructions can be effectively simultaneously executed. The cycle time of a pipelined pro- 
cessor Is determined by the longest executing stage rather than by total instruction execution time. Examples 
of pipelined systems are found in US-A-4,794,524, US-A-4,589,067, and US-A-4,620,275 among others. 

To keep a pipelined system efflcient, instruction executions must be started before other instruction execu- 
tions are completed; otherwise various processing stages in the pipeline are idle. This latter situation Is known 

10 as "pipeline breakage." Such breakage is often caused by data dependencies for instruction execution, i.e. a 
given instruction may need the results of a prior Instruction; such prior instruction has not been executed to 
provide such data until after the given instruction Is ready to enter the pipeline of Instruction execution. In this 
situation, the pipeline stages becomes idle until the required data is made available. 

Pipeline breakage can be reduced by using a vector instruction set wherein each vector instruction causes 

IS a particular operation to be repeated for each element of the vector. As the respective instruction executions 
which cause operations on the vector elements are perfonmed on different elements, the problem of data depen- 
dencies Is removed. The price paid for selecting a vector instruction set is that the same operation has to be 
perfonmed on all elements of the vector. There are instances in numeric processing when the operation to be 
performed is not always the same for all elements of a vector. That is, a scalar operation must now be perfonmed 

20 rather than a sector operation. 

Based on the above facts, some existing machines employ vector instruction sets to avoid the breakage 
problem because each vector operation can specify many calculations with no result dependencies. Although 
most numeric applications inherently have significant parallelism, it is not always of a vector type where the 
same computation Is performed on many elements of data. Circuit simulation Is one example of this requlre- 

25 ment The programmed device models have data dependent branches so that not only do the different device 
types require difTerent calculations but two different devices of the same type may require different calculations 
as a function of their current modelled operating state. 

Another known approach to avoid pipeline breakage Is to use the pipeline for executing simultaneously a 
plurality of programs which are interieaved for execution in the pipeline US-A-4,589,067, US-A-3,728,692 and 

30 US-A-4,229,790 all show such an arrangement. In this arrangement, if there are eight stages for executing 
Instructbns, there can be up to eight programs simultaneously executed with the data dependencies being 
limited to the respective instruction streams of the eight programs. In the example, every eighth instruction 
entering the pipeline is from one of the eight programs being executed. This anrangen^nt, In effect, makes the 
pipeline look like eight independent processors. In such solutions to the pipeline breakage problem, electronic 

35 circuits are usually added to such processors for tracking the multiple Instruction streams for interieaving same 
into the pipeline. This latter arrangement Is refenned to herein as "parallel operations or parallel mode". 

Many numeric applications use a diversity of computational algorithms. Execution of the various algorithms, 
within one program or set of programs, usually can be made most efficient by using different program execution 
strategies, such as set forth above in the cited references and as briefly described herein. What is needed is 

40 a means and method for enabling usage of the pure pipeline operations, vector operations, or parallel oper- 
atk)ns to execute efficiently any numeric application which requires more than one of the three above-described 
approaches. By the use of the present invention, which provides for satisfying the needs of the three 
approaches, significant perfonmance improvements are available. 

This invention seeks to enable dynamic switching between and executing parallel pipeline, vector and par- 

45 allel operations in pipelined data processing apparatus. 

Further, this invention seeks to reduced pipeline breakage In a pipelined data processing apparatus. 
In accordance with certain aspects of the present invention, apparatus is provided in a pipeline which 
enables each of the stages In the pipeline to operate In diverse modes of operation for achieving a dynamically 
changeable mode of operation between pipeline mode, vector nnode and parallel nrKxJe. The switching prefer- 

50 ably occurs at or about instruction fetch times for the respective pipeline steges or processors. All three types 
of operations can be occurring simultaneously in a pipeline. 

Each stege of the pipeline has 'its own independent program counter. The pipeline iteelf has a global pro^ 
gram counting means. The global program counting means provides instruction addresses to the stages cur- 
rently operating in a pipeline or vector modes while the stages operating in or parallel mode employ their 

55 respective program counting means. Communication means are provided for enabling signalling program 
status between the stage and the global program counting means. One of the communication means is primarily 
used for enabling vector operations. It is preferred that the date used in any of the operattons be stored in a 
shared data storage unit; Instructions are preferably stored in a shared instruction storage unit. Switching stages 
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between the three modes of operations is effected by programmed means; specrTically, in a particular form of 
the invention, special branch or jump Instructions effect switching the modes of operation of the stages. Coding 
the special branch instructions into programs to be respectively executed in the pipeline, vector or parallel mode 
enables switching the stages between the three modes. The vector mode is indicated by Information stored In 

5 a so-called index register which is accessible by any of the stages. In a preferred form of practising the invention, 
the machine is initialized and begins operation in the pipeline mode from which the other modes may be selected 
for any of the pipeline stages. 

The scope of the Invention is defined by the appended claims; and how it can be carried into effect is des- 
cribed hereinafter with reference to the accompanying drawings in which: 

10 Fig. 1 is a simplified block diagram of a pipeline numerical processor which illustrates a preferred embo- 
diment of the invention; 

Fig. 2 shows operation of a typical pipeline executing in a parallel mode; 

Fig. 3 is a simplified data flow diagram illustrating data flow for the Fig. 1 Olustrated processor; 

Fig. 4 illustrates program control means used in the Fig. 3 illustration; 

IS Fig. 5 is a simplified execution illustration chart on control and mixing of pipeline, vector and parallel oper- 
ations or mode In the Fig. 1 illustrated apparatus while using the Fig. 4 Illustrated program control means; 
Fig. 6 illustrates the cyde timing for the Fig. 5 operation illustration; and 
Fig. 7 is a simplified illustration of an Instruction format usable in the Fig. 1 illustrated apparatus. 
A general purpose processor, such as a IBM PS2 or RISC computer, is a host 10 (Fig.1) for pipeline 

20 apparatus 11. Data storage unit 12 is shared between host 10 and apparatus 1 1 for providing communication 
between the two data processors. Typiplcal message procedures are used for these communications. Host 10 
provides the supervisory operations while apparatus 1 1 does the specialized numeric processing amenable to 
pipeline functions. Apparatus 11 includes a pipeline program system 14 which includes program counting 
means and other controls, (Fig. 4). The usual processor or stage counter 15 cycles the pipeline and indicates 

25 the processor or stage selection to Instruction storage system 1 6 (Fig. 1 ) The actual numeric processing is per- 
formed by special processors 17, which in one embodiment include a floating point unit (FPU) 30 and table 
lookup units (TLU) 41 .(Fig. 3). The operations of processors 1 7 are time sliced between the pipeline stage pro- 
gram counters 71 to 76 (Flg.4) creating pipeline stages, i.e. virtual processors, each of which has predetenmined 
time slices of processors 17 and one of the stage program counters 71 to 76. 

30 In accordance with the invention, the pipeline program system 14 includes program counting means for 
each of the pipeline stages, and a global program system 20 cooperates with the pipeline the pipeline program 
system 14 for enabling program-effected or dynamic switching between and concurrent pipeline, vector and 
parallel operations through pipeline program system 14. Global index register 21 enables control of vector oper- 
ations. The instruction storage system 1 6 responds to requests for instructions In a usual manner for pipelined 

35 apparatus. System 1 6 also communicates with the global index register 21 . The global program system 20 has 
a plenary pipeline and vector control over pipeline program system (PPS) 14 and is Instrumental in switching 
from pipeline operations to either parallel or vector operations. The program execution In either the parallel or 
vector modes effect mode switching for the pipeline stage executing in such modes back to the pipeline mode. 
Fig. 2 illustrates a parallel mode of operation for one operatk>n cycle of a pipeline. This figure is a time "snap- 

40 shot" of execution of sue instructions from six different programs or Instruction streams. Shown are sb( stages 
of pipeline, the figure shows seven steps so that one instruction stream 1 is shown with two instructtons. Pipeline 
Interleaving of the pipelined instruction stream is round-robin, such as enabled by a processor counter 15. The 
tenn "stream" indicates which of the six programs or Instruction streams are being used, "instruction" indicates 
the relative Instruction number in the respective Instruction streams, and the labels in the bones 1 to 7 respect- 

45 iveiy indicate what machine operation is being performed, box 2 indicating the first stage, and box 7 the last 
operation in the pipeline for executing an instruction. Box 1 indicates the instruction fetch prior to the Instruction 
execution being started In the pipeline. This operation occurs concurrentiy with the operation represented by 
box 7. The labels in the boxes 2 to 7 respectively indicate the machine operations performed by the stages in 
executing an instruction. Box 2 represents stage 1, the pipeline entry stage, which effects a data fetch for 

50 Instruction 1 of stream or program 6. Concunrentiy therewith, stage 2, represented by box 3, executes the sec- 
ond pipeline step for instruction 1 of Instruction stream 5, and so on to stream 1 instruction 1 at box 7. The 
pipeline appears as sbc "virtual processors". The time slicing of processors 17 Is apparent from Inspection of 
boxes 4 to 6. The arrangement is such tiiat at box 7, the data is stored at box 7 when the instruction fetch for 
the second instruction of the Instruction stream 1 next executing instruction 2. This selection prevents pipeline 

55 breakage, assuming that all sbc Instmction streams are truly independent This arrangement also requires mul- 
tiple program counting means, one for each of the Instruction streams, a processor counter 1 5 and a multiplexor 
for the multiple program counting means to fetch Instructions. In the pipeline mode, all six concurrentiy execut- 
ing instructions are in the same instruction stream. In accordance with the present invention, the parallel, vector 
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and pipeline modes may occur simultaneously in the pipeline. For example, boxes 2, 4, and 6 (pipeline stages 
1, 3 and 5 respectively) execute an Instmctton In the pipeline mode while the remaining pipeline stages are 
executing Instructions In the parallel mode. 

All the Instructions for all nrK)des of operation have the same format, shown In Fig. 7. Most of the machine 
6 operations done by the Instructions are the same. The following description with reference to Fig. 3 details the 
machine operations used to execute an instruction. Whenever the operations are different for the different types 
of instructions, the details of such differences are described later in connection with each such instruction. 

Processor Selection 

10 

A first step in executing an Instruction Is to select the processor that will be executing the instruction. Pro- 
cessor counter 15 (Flg.4) Indicates by Its count supplied overcount bus 25 to pipeline system 14 which of the 
processors or stages in the pipeline system is to be used for the next instruction to be executed. The counting 
is continuous to effect a round-robin selection of the stage processors (program counters 71 to 76, as described 
IS above) within every pipeline cycle. 

Instruction Fetch 

Once a stage processor (pipeline stage) is selected, its stage program counter (Fig. 4) addresses the 

20 instruction store 26 (Flg.3) through multiplexor 28 (Fig. 4) and address bus 27 to fetch the next instruction in 
its instruction stream. The instruction address also travels from bus 27 to address Incrementer 81 for incre- 
mentation fbrthe next following instruction address during a cunrent phase of the pipeline cycle. Five stage delay 
circuit 82 receives the incremented addresses for transmittal through multiplexor 83 to the stage program 
counter indicated by processor counter 15. 

25 The mode of operation for the respecth^e pipeline stages Is determined by the instruction currently being 
executed by that pipeline stage. The current mode of instruction execution Is also stored in the pipeline delay 
latches PDL, a shift register 84 having shift register stages PDL-1 to PDL-6. A clock signal supplied over line 
85 from control 52 (Fig.3) Indicates the end of each pipeline execution phase (six Indications per pipeline cyde) 
which increments processor counter 15 (Fig.4) and shifts the contents of shift register 84 one position. At the 

30 beginning of each pipeline execution phase (box 1, Fig. 2), the OPCODE on line 99 sets PDL-1 stage of shift 
register 84 to indicate the mode of the cunrent instruction being fetched, i.e. pipeline or parallel. Vector oper- 
ations are effected through the pipeline Instructions pipeline calculate, pipeline branch or pipeline set global 
flag while index register 21 (Fig.3) has a count greater than one. The next dock indication on line 85 (Fig.4) 
shifts the contents of PDL-1 to PDL-2 in synchronism with processor counter 1 5 and the operation of the pipeline 

35 stages induding the time-sliced portions of processors 1 7. In this manner the mode, pipeline or parallel, follows 
the instruction execution in the pipeline, the shift register bits PDL-1 to PDL-6 Indicating the mode to all com- 
ponents In the pipeline apparatus used in instruction execution as collectively indicated by arrows 86. Multip- 
lexor 83 also responds to the shift register 84 PDL mode indications to select either the Incremented instruction 
address supplied through five stage delay 82 in the parallel mode or the instruction address cunrently stored 

40 In global program counter 34 (Fig.4) in the pipeline mode. Multiplexor 100 responds to the lines 86 mode indi- 
cations to selectively pass the incremented global program counter 34 contents received from incrementer 87, 
all as described with respect to the definitions of the instructions used in the pipeline and parallel modes and 
the vector indication of global index register 21 Inhibiting multiplexor 100 from passing the increment global 
Instructton address as Indicated by a signal on line 88. When the contents of global index register 21 Is greater 

45 tiian unity, then irrespective of the mode of the current instruction, multiplexor 1 00 is inhibited from passing the 
global program counter 34 incremented address from Incrementer 87. When the contents of global index regi- 
ster 21 Is equal to one, tiien the instruction mode indicated by shift register 84 conb-ols multiplexor 1 00 to effect 
incrementatk3n of or hold the current instruction address in global program counter 34. Incrementation of the 
global program counter 34 occurs during box 1 (Fig. 2), the instruction fetch, i.e. is effected through the onset 

50 on the execution of the cunrentiy executing instruction right after the OPCODE Is decoded, multiplexor 1 00 being 
enabled or not by the OPCODE of the cunrent instruction. 

Address Calculation 

55 Following the instruction fetch, addresses contained in the instiruction are computed. Each instruction con- 

tains three addresses, BD1 , BD2 and BD3 (Fig. 7). These addresses are used respectively as source 1 , source 
2 and destination addresses. The source 1 and source 2 addresses are used to fetch two words from data store 
12 that will be operated on by the FPU 30. For non-branch instructions, the destination address specifies the 



4 



EP 0 490 524 A2 



location in data store 12 that the FPU 30 calculated result Is to be stored. For branch Instructions, the destination 
address is the branch target address. 

The address fields are known as base-displacement (DB). The field identifies the base register (not shown) 
containing a base address, a displacement from the base address, whether the addressing Is indirect or in an 

5 index mode. This an^angement provides four addressing modes, a true base plus displacement, base plus 
index, base plus displacement indirect, and base plus indexed indirect Each of the three addresses in an 
Instruction has its own addressing mode. If more than one of the addresses specifies an indirection address, 
then the indirect address Is selected using the priority of source 1 first, source 2 second, and then destination 
address. The selected Indirect address is then applied to all addresses specifying Indirection in that one Instruc- 

10 tion. Pointers in data store 12 may also be used by host 10 which does byte addressing. 

Global Program Counter Update 

Next, global program system 20, which includes global program counter 34 (Fig 4) is updated as necessary. 
15 Such updating occurs only in the pipeline mode, in either the vector or parallel modes global program counter 
34 Is not updated. Such operation mode is reflected in the INSTR OPCODE 35 (Fig. 7). If the pipeline has six 
pipeline stages 71 to 76, operating in the pipeline mode, then the instruction address for the successive instruc- 
tions for the respective pipeline stages are displaced by six. 

20 Data Fetch 

This operation is represented by box 2 (Fig. 2). Any of the above-described addressing modes can be used. 
The MISC field 37 (Fig. 7) contains immediate bits for the BD1 field in the instruction word. When the source 
1 1mmediate bit Is set to an active condition, then the source 1 data, rather than being accessed from data store 
25 12, is the source 1 address BD1. If the Immediate bit Is not set, then the source 1 data Is fetched from data 
store 12 at the source 1 1ndicated address. The source 2 data Is fetched from the data store 12. 

Address Register 40 Load 

30 Address registers 40 (Fig. 3) supply data addresses to data store 12 and branch addresses to the desig- 
nated program counter 34 or 71 to 76 all through switching logic 51. The address registers 40 are addressable 
using known addressing techniques. The address of the address registers 40 specified in the MISC field 37 
will be loaded from the data specified In tiie load source portion of field 37. The following sources may be speci- 
fied as the address register load source: no address register will be loaded, the address register will be loaded 

35 with a pointer from the data word fetohed at the address contained in BD2, the address register will be loaded 
with an integer from the data word fetched at the address contained in BD2, the address register will be loaded 
with the contents of BD2 (source 2 address), the address register will be loaded with the contents of the stage 
program counter 71 to 76 (Fig. 4) associated with the instruction plus unity (loads the address for the next 
instruction word to he fetched for the pipeline stage), or the address register will be loaded with the contents 

40 of the global Index register 21 . 

Table Lookup 

Processors 1 7 include table lookup unit (TLU) 41 which can be used to accelerate execution of table based 
45 algorithms, such as divide, square root, logarithm, exponential and the like TLU 41 may also include general 
tables, such as those used in program models of an entity or device to be simulated or analyzed. The table 
lookup operates on the source 1 data, and to provide a table looked-up value to FPU 30 over bus 42 in place 
of the BD1 specified source 1 data. The tables can be In a ROM fonm, down loaded from host 10 into a RAM, 
loaded from a diskette Into a RAM, and the like. To support a variety of table solvable algoritiims the TLU CON- 
50 TROL field 44 (Fig. 7) not only indicates which table of several to be used but also operational parameters 
beyond the present description. This operation is represented by box 3 (Fig. 2) FPU LOAD AND TABLE 
LOOKUP. As one example, a table can be used for extracting the exponent from a floating point number and 
a second table can be concurrently used for extracting the mantissa of such floating point number, both values 
to be operated upon by FPU 30 in execution of the current instruction. 

55 

FPU 30 Calculation 

The source 1 (or its TLU derivative) and source 2 data words are loaded into FPU 30, one to a multiplier 
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portion (not shown) and one to an arithmetic-logic unit (ALU) (not shown). Provisions are made for swapping 
the FPU internal destinations of the two data words, such as may be provided in FPU MOD field 46 (Fig. 7) 
associated with the FPU OPCODE field 47. Both data words are computed upon as a function of the FPU 
OPCODE field 47 contents. FPU 30 operations are represented by boxes 4, 5 and 6 (Fig. 2). 

5 

Program Counter Update 

The stage program counter is updated to the address of the next Instruction to be executed by this pipeline 
stage. The source of this update is a function of the INSTR OPCODE field 35. 

10 

FPU 30 Result Storing 

If the iNSTR OPCODE is one of the later-described calculate types, then a result has been calculated by 
FPU 30 that may be stored in data store 12 at the destination address. Before the FPU result is stored, if the 

15 flip destination bit (not shown) in the FPU MOD field 46 is set, then the high order half of the calculated result 
and the low order half of the calculated result are swapped before storage. The storage of the FPU result is 
done under the control of the FPU MOD field 46 as follows: The result Is not stored, only the high order half of 
the FPU result is stored, only the low order half of the FPU result is stored, or the entire FPU result is stored. 
This storage activity is represented by box 7 (Fig. 2) data store. 

20 Access to address reglstera 40 (Fig. 3) is under control of multiplexor MUX 50 in a known manner for achiev- 
ing the results set forth above. MUX 50 passes the address for selecting which of the address registers is to 
be accessed. The instruction data from instruction store 26 is loaded into the address register 40 indicated by 
the register address supplied through MUX 50. Logic 51 includes adding units respectively for the source 1, 
source 2 aid destination addresses, such as for adding base and displacement values, etc, as received from 

25 the address registers 40 and instruction store 26. Controls 52 include the usual Instruction decoder, instruction 
execution sequencing circuits, and circuits known In the art which supply control signals for effecting machine 
operations which execute decoded instructions. The intemal construction of such controls are well known, 
Implementation of Instruction execution sequences are discernible from descriptions of such instructions; 
accordingly a detailed showing of tiiese controls is dispensed with to avoid further obfuscation of the description 

30 of the invention. The an-ows 52A and 52B indicate instruction word connections 29 to instruction store 26. Out- 
put connections also go to logic 51 and other units of pipeline apparatus, as is known. The modified addresses 
supplied fipom logic 51 pass through multiplexor 53 to data store 12, to MUX 50 for selecting another address 
register or to pipeline program system 14 (destination address) for a branch operation, all in accordance with 
instruction decoding in control 52 (control connections not shown). Address information from MUX 53 also 

35 travels over bus 54 to select a table in TLU 41 using MUX 55 to switch the table selection input as indicated 
by the OPCODE MOD field 58 (Fig. 7) to effect the functions described above for the TLU 41. Such signals 
also travel to FPU 30. Additionally, FPU 30 receives data words to be operands over bus 60 from data store 
12. Bus 61 carries data from data store 12 to FPU 30 and to MUX 50 for loading address register 40. Global 
index register 21 receives vector control data from logic 51 and supplies its stored data to load one of the 

40 address registere 40 via MUX 50. Global program system 20 receives branch and other control information over 
bus 65 from instruction store 26 or address registere 40 via logic 51. Global program system 20 manages the 
pipeline program system 14 by supplying address and control information over bus 67. Status register 90 is 
addressable by host 1 0 as Indicated by lines 91 respectively for writing into status register (including resetting 
after reading predetermined status bits) and for reading the contents thereof. The contents of status register 

45 90 includes error information relating to pipeline apparatus 1 1 , such as instruction decoding encore set into the 
status register by control 52, other en^or information from other components shown in tills application, branch 
status, host signal pending bit (signifying to host 10 that infonmation or calculation results are available in data 
store 12,), host Interrupt pending bit which indicates to control 52 that pipeline apparatus 1 1 has signalled host 
10 to read from tiie communication area (not separately shown herein) of data store 12 for obtaining operations 

50 to be performed for host 10, status of FPU 30 operations including calculation error indications, global flag bit 
(later described) and tiie like. The contents of status register 90 enable communications between host 10 and 
pipeline apparatus 1 1 as well as containing program status information useful in the different modes of operation 
of pipeline apparatus 1 1, as will become apparent 

Before describing more electronic apparatus used in the illustrated embodiment, tiie instruction types and 

55 tiieir respective operations are described. The instruction operation code in INSTR OPCODE field 35 (Fig. 7), 
as modified by the modifier in OPCODE MOD field 58, controls how the later described program oountere are 
updated, whether a FPU result will be stored, and the loading of certain global facilities found in global program 
system 20. By controlling the program counter updating, the opcode detennlnes whether the pipeline stage 
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executing the instruction is operating in parallel mode or in pipeline mode. The vector mode is detennined by 
global index register 21 having a value greater than 1 and the respective pipeline stage obtaining its next instruc- 
tion address from global program counter 34, as wilt become more apparent. A parallel mode opcode typically 
increments the respective stage program counter 71 to 76 by one causing the pipeline stage to execute the 

5 next instruction in its parallel subprogram without affecting what the other pipeline stages are doing. A pipeline 
mode opcode typically enables incrementing global program counter 34 in the first pipeline operation (box 2, 
Fig. 2) and then loads the processor counter 15 to one of the stage program counters 71 to 76 from global pro- 
gram counter 34 In the last pipeline operation corresponding to box 7 "data store" (Fig. 2). If all six pipeline 
stages are executing pipeline mode instructions, then global program counter 34 is incremented by each of the 

10 six pipeline stages before global program counter 34 supplies the instruction address to designated one of the 
stage program counters 71 to 76. So if the first pipeline stage executes an instruction in the pipeline mode at 
address XY, then its stage program counter will be loaded with XY 6. In this fashion the first pipeline stage 
executes the instruction at address XY, the second pipeline stage executes an instruction stored at address 
XY+1, the third pipeline stage executes an instruction stored at address XY+2 and so on. This global program 

IS counter controlled sequencing continues until later described branch-to-parallel mode instructions are encoun- 
tered in the stream of instructions currently being executed in the pipeline mode. 

The below-described instruction opcodes illustrate the practice of the present invention in the illustrated 
embodiment. In the description below, the tenm program counter refers to the stage program counter 71 to 76 
of the pipeline stage executing the instruction. The mode of instruction is indicated by the first word, i.e. parallel 

20 or pipeline. Parallel instructions are described first; for the most part, parallel instructions have pipeline coun- 
terpart instructions. 

Parallel Calculate 

25 This instruction effects calculations witiiin a parallel subprogram. The following operations are perfonned: 
address calculation (box 1, Fig. 2). data fetch (box 2. Fig. 2), address register load, table lookup (box 3, Fig. 
2), FPU calculation and FPU result store are done as described (boxes 4,5 and 6, Fig. 2). The program counter 
is incremented by one. The global program counter 34 i not updated in pipeline mode. 

30 Parallel Calculate Serialized 

This instruction Is the same as the Parallel Calculate instruction except that it is serialized. Serialization 
means that any resource, such as a storage location in data store 12, is dedicated to the instruction execution 
and is not available to any requestor until after the completion of this instmction execution. This action prevents 

35 any calculate operation from starting until any other pipeline stages with outstanding serial operations using 
the same source 1 address have completed. Thus a calculation reflects any updates to the source 1 data that 
were in progress when the instruction was encountered. This instruction is used to allow several parallel sub- 
programs to update the same location when the order of updates is not important. Once an instruction has been 
serialized, all other pipeline stages are prevented from starting a serial operation using that source 1 address 

40 until the entire sequence has been executed. 

Parallel Calculate and Retum 

This Instruction is the same as Parallel Calculate except that rather than incrementing the stage program 
45 counter at the end of the instruction execution, this action causes the pipeline stage to return firom a parallel 
subprogram to pipeline mode operations. 

Parallel Calculate and Retum Serialized 

50 This instruction is the same as the Parallel Calculate and Return instruction except that it is serialized in 
the same manner as in Parallel Calculate Serialized. As set forth above, all the parallel mode Instructions having 
the retum feature enable the executing stage processor to retum from executing a parallel mode subprogram 
to pipeline mode operations. 

55 Parallel Branch 

This Instruction is used to perform a conditional branch witiiin a parallel subprogram. The branch is con- 
ditional on the results of tiie FPU calculation perf'ormed by THIS instruction, it can be thought of as a compare 
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and branch instruction. The FPU OPCODE in field 46 (Fig. 7) should be a floating point compare for floating 
point data and a integer subtract for integer data. The following operations are performed: address calculation, 
data fetch, address register load, table lookup, and FPU calculation. The FPU result Is not stored but the FPU 
condition code is used with a condition set forth in the FPU MOD field 46 to determine if the branch is to be 
5 talcen. One of the following conditions are used to effect a branch: brach if less than, branch if less than or equal, 
branch if equal, branch if greater than or equal, branch if greater than, branch if not equal, unconditional branch, 
or never branch (no-op). If the branch is taken, the respective stage program counter 71 to 76 (Fig. 4) is loaded 
with the destination address; not taking the branch increments the respective stage program counter by one. 
The global program counter 34 is not updated in this parallel mode Instruction execution. 

10 

Parallel Branch Serialized 

This instruction is the same as the Parallel Branch instruction except that the serialization is in effect. This 
serialization control prevents the branch operation from starting until any other processors with outstanding 
IS serial operations to the same source 1 address have completed. Thus a compare done within this branch 
instruction will reflect any updates to the source 1 data that were in progress when the branch was encountered. 

Parallel Set Global Flag 

20 This Instruction is used to set the a global flag bit in status register 90 (Figs. 3 and 4) from within a parallel 
subprogriim. This instruction effects transfer of program contol infomiation from a parallel subprogram to a 
pipeline program. The following operations are perfonmed: address calculation, data fetch, address register 
load, table lookup, FPU calculation, and take branch calculation are the same as for parallel branch above. If 
the take branch calculation indicated a branch then the global flag register is set Otherwise the global flag bit 

25 of status register 90 is not changed. The respective stage program counter 71 to 76 is incremented by one. 
The global program counter 34 is not updated (parallel mode instruction). The global flag bit of status register 
90 can be sensed later by a program executing in the pipeline mode. 
The instructions described below are pipeline mode Instructions. 

30 Pipeline Calculate 

This instruction is used to perfoni) a calculation within a pipeline mode program. The following operations 
are performed: address calculation, data fetch, address register load, table lookup, FPU calculation and FPU 
result store are done as described above. The global program counter 34 is incremented by one. The respective 
35 stage program counter 71 to 76 is loaded with the current instruction address in the global program counter 
34. 

Pipeline Calculate and Load Global Index Register 

40 This instruction is used to perform a calculation within a pipeline mode program and to load the global index 
register 21 in preparation for a vector operation. A vector operation is performed by repeating the instruction 
at the next instruction address which is stored in global program counter 34. Any number of pipeline stages 
can execute such instruction during its respective time slice, each pipeline stage repeatedly obtains instruction 
addresses from the global program counter 34. Pipeline stages operating In the parallel mode stDl obtain their 

45 respective instructions through incrementing the respective stage program counters 71 to 76; upon completion 
of the parallel program, a given pipeline stage obtains its next instruction from global program counter 34 to 
participate in the vector calculation until it is completed as indicated by the index register 21 being down counted 
by the execution of each instruction in the vector nrKxie until unity. The following operations are perfonmed by 
this instruction execution in preparation for the upcoming sector calculation: address calculation is done as des- 

so cribed above. The global index register 21 is loaded with the value of the source 1 address before indirection, 
if any is specified, is perfomned. The global Index register is loaded immediately after address calculation (as 
In box 2, Fig. 2) and its value may be loaded Into an address register 40 In the same Instruction. Data fetch, 
address register load, table lookup, FPU calculation and FPU result store are done as described above. The 
global program counter 34 is incremented by one. The respective stage program counter 71 to 76 is loaded 

55 with the current contents of global program counter 34. 

Whenever a value greater than zero is loaded into the global Index register, incrementing of the global pro- 
gram counter 34 is inhibited starting with the next instruction. The global index register content is decremented 
by one each time a pipeline mode instruction is executed. This control action continues until the global Index 
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register 21 contents have been decremented to zero. This action causes the next instruction to be executed a 
given number of times as indicated in the giobai index register 21 for effecting pipeline mode established pro- 
gram control for use in a vector mode calculation involving a variable number up to six of the pipeline stages. 
Only the following Instructions are valid for the next Instruction to be repeatedly perfonmed to effect a vector 
5 calculation: Pipeline Calculate, Pipeline Branch or Pipeline Set Global Flag. The contents of the global index 
register 21 can be loaded Into an address register 40 so that it can be used to index vector data. 

Pipeline Branch 

10 This instruction is used to branch the pipeline stage executing it from pipeline mode to a parallel mode sub- 
program to be next executed by such executing pipeline stage. Note that the pipeline branch can be executed 
In the vector mode for causing a predetermined number of the pipeline stages to execute a series of subprog- 
rams, either for vector operations or for executing a series of parallel subprograms. In executing this instruction, 
the following operations are pierformed: address calculation, data fetch, address register toad, table lookup, 

15 FPU calculation, and take branch calculation proceed the same as for Parallel Branch above. The global pro- 
gram counter 34 Is incremented by one. The respective stage program counter 71 to 76 updating proceeds the 
same as parallel branch described above. 

Pipeline Branch Global 

20 

This instruction is used to perform a conditional branch within a pipeline mode program. The branch is done 
by altering the address contents of the global program counter 34 so that the next pipeline stage indicated by 
the processor counter 15 executes the branch target instruction. Since the global program counter 34 must be 
updated in the first pipeline stage, the global flag bit In status register 90 is used to detenmine the branch decision 

25 rather than the later available FPU calculated result, which is usable as a branch condition code In other branch 
operations. The following operations are perfonmed in the instruction execution: address calculation, data fetch, 
address register load, table lookup, and FPU calculation are done as described above. The FPU result is not 
stored. The global program counter 34 Is updated according to the Instruction conditions as set forth in the 
opcode mod field 58 field as follows: (1) "if set"; If the Global Flag Register is set then the global program counter 

30 34 is loaded with the destination address before address indirection. Otherwise the global program counter is 
incremented by one. (2) "if not set"; If the global flag bit in status register 90 is not set, then the global program 
counter 34 is loaded with the destination address before address indirection. Otherwise the global program 
counter is incremented by one. (3) "unconditional action"; The global program counter 34 is loaded with the 
destination address before address indirection. (4) "end action"; The global program counter 34 is incremented 

35 by one. The global flag bit in status register 90 is reset. The respective stage program counter 71 to 76 is loaded 
with the contents of the giobai program counter 34. As the FLU result cannot be stored, the branch is conditioned 
by the global flag bit in status register 90 rather than using the FPU 30 calculated condition code. In executing 
this instruction, no useful function is performed by FPU 30. 

40 Pipeline Set Global Flag 

This instruction Is used to set the global flag bit in status register 90 from within a pipeline mode program. 
The following operations are performed: adress calculation, data fetch, address register load, table lookup, and 
FPU calculatbn and flag setting proceed the same as for parallel set global flag instruction described above. 
45 The global program counter 34 is incremented by one. The respective stage program counter 71 to 76 indicated 
by processor counter 15 is loaded with the instruction address contents of the global program counter 34. 

Pipeline Wait 

50 This instruction causes the giobai program system 20 to wait for all executing parallel subprograms to com- 
plete. In this wait mode, no pipeline instructions (including vector operations using the instruction address 
stored In global program counter 34) are executed. For executing this instruction, the following operations are 
performed: address calculation, data fetch, address register load, table lookup, and FPU calculation are done 
as described above. The FPU 30 result is not stored. The global program counter 34 is incremented by one 

55 after each of the other five pipeline stages completes an instmction that will cause it to load its respective stage 
program counter 71 to 76 with the contents of the global program counter 34. For example, if the other pipeline 
stages are executing parallel subprograms, then the giobai program counter 34 is not again incremented until 
all the subprograms have completed. Any pipeline stages executing In the pipeline mode start waiting upon 
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completion of the current cycle of execution of the pipeline operation, i.e. complete execution of one instruction. 
When the global program counter 34 is not being Incremented, the pipeline wait instruction Is executed 
repeatedly by the processors that have completed their subprograms. When executions of all the subprograms 
have completed, the global program counter 34 is again incremented and pipeline mode execution resumes 
5 at the next pipeline instruction to be executed following the described wait operation. The respective stage pro- 
gram counter 71 to 76 is loaded with the instruction address contents of global program counter 34. As the FPU 
30 calculated result is not stored and the FPU condition code is not used, no useful function is perfomned by 
the FPU. 

10 Pipeline Signal Host 

This instruction Is used to set the host signal bit in the pipeline status register 90, a separate hardware 
register. The following operations are perfomr^ed: address calculation, data fetch, address register load, table 
lookup, and FPU calculation are done as described above. The FPU result is not stored. The host signal bit in 
16 the pipeline status register 90 is set. The global program counter 34 is incremented by one. The respective 
stage program counter 71 to 76 is loaded with the contents of the global program counter 34. As the FPU result 
is not stored and the FPU condition code is not used, no useful function is performed by the FPU 30. 

Signalling host 10 is initiated by this instruction through setting a "host signal pending" bit of status register 
90. Host 10 is programmed appropriately to time the sensing of status register 90 for the host signal pending 
20 bit or It can be achieved through known electrical circuits which respond to the host signal pending bit, such 
as in a known response to an interruption signal. 

The sequencing of operations of both program systems 14 and 20 (Fig. 3) is effected by control 52. In the 
global program system 20 (Fig. 4), a multiplexor 100 detenmines the input to the global program counter 34, 
, both as to branch input addresses and instruction address incrementation. The decoded INSTRUCTION 
25 OPCODE received over bus 99 from field 35 of the Instruction word to be executed, controls multiplexor 100 
to receive either the increment signal on line 1 01 , or the branch address received over cable 65 or neither of 
the above-such as when all stages In the pipeline are executing In the parallel mode. When the instruction 
opcode Is any of the pipeline instructions, then the line 101 increment signal is passed to the global program 
counter 34. 

30 Figs. 5 and 6 diagrammatically slows the serious pipeline stages in different modes of operation. Initially, 
all six pipeline stages PI to P6, also respectively represented by the program counters 71 to 76 in Fig. 4, are 
In the pipeline mode. The numbers 11 to 128 in the number column are merely reference numbers used to relate 
the timing diagram of Fig. 6 to Fig. 5. The first instruction at reference number 11 is a pipeline to parallel mode 
branch executed by pipeline stage P1 in the pipeline mode. The notation P1-1 indicates that stage processor 

35 PI is executing its instruction number 1. Fig. 6 shows this instruction execution as occurring in pipeline cycle 
number 1. The second instruction for processor PI is 119, executed in pipeline cyde 7. The third instruction for 
processor PI is 120 occum'ng in pipeline cycle 13 of Fig. 6 (during the third iteration of the pipeline controlling 
processor counter 15). Lines 119 and 120 represent P1 executing a parallel subprogram white later described 
other ones of the stage processors are executing in the pipeline mode or executing a vector subroutine. The 

40 fourtii instruction of processor PI is 16 in a pipeline to parallel branch for causing processor P1 and other stage 
processors P4, P5 and P6 to perfbnm one vector operation, each of the stage processors operating on different 
elements of the four element vector being processed. The branch Instruction 16 for processor PI occurs at 
pipeline cycle number 19 of Fig. 6. The fifth and sixth instructions 127 and 128 for processor PI occur in cycle 
numbers 25 and 31, showing the vector operation which also occurs In the three other stage processors P4, 

45 P5 and P6. The. seventh instruction 118 for processor PI is a pipeline mode instruction executed in pipeline 
cycle number 37. 

The sequence of operations in different modes for all the stage processors P2 to P6 can be followed in a 
similar manner. Also, the parallel and vector programs are shown as containing only two instructions only for 
the purposes of illustrating the operation of the present Invention. In a practi'cal program, the parallel mode open* 

50 ations and the vector operations could employ hundreds or more instructions in each excursion of the respective 
stage processor from the pipeline mode to the parallel mode. 

The table below also shows the timing relationships and concunrent operations of the stage processors PI 
to P6 in different modes. In this table tiie pipeline cycle number lists the lowest and highest numbered cycle in 
ttie respective pipeline iterations, i.e. number 01-06 indicates cycles 1-6 in the first pipeline Iteration. The 

55 abbreviation PI indicate pipeline mode execution, PB pipeline mode to parallel mode branch; PA parallel mode 
executton; PJ parallel mode to pipeline mode branch; VE means vector mode operation using a parallel type 
Instruction and the Index register 21 ; and VB vector mode (which repeatedly executes one pipeline instructions 
simultaneously in a variable number of pipeline stages at indexed data addresses) to pipeline mode as the index 
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register counts down to one. 

TABLE 1 

5 



CYCLE 


01-06 


07-12 


13-18 


19-24 


25-30 


31-36 


PI 


PB 


PA 


PJ 


PB/VB 


VE 


PJ/VE 


P2 


PB 


PA 


PA 


PA 


PJ 


PI 


P3 


PB 


PA 


PJ 


PI 


PI 


PI 


P4 


PI 


PB/VB 


VE 


PJ/VE 


PI 


PI 


P5 


PI 


PB/VB 


VE 


PJ/VE 


PI 


PI 


P6 


PB/VB 


VE 


PJ/VE 


PI 


PI 


PI 



The first pipeline iteration 01-06 finds all six stage processors in the pipeline mode, the second iteration 
07-12 finds PI to P3 in the parallel mode, P4 and P5 in the pipeline mode and PS in the vector mode. Exami- 
nation of Table 1 shows the dynamic changing of modes among the stage processors P1 to P6 which In effect 
20 provides load balancing for efficient total processing using any of the three described modes of operation. The 
following discussion describes some aspects in employing the present invention in performing some program 
operations. 

To load data into and store the contents of one of the address registers 40, any of the instructions described 
above may load an address register simply by specifying a load source and an address register 40 number 

25 (address of the address register) to load. There are various different ways a program could load address regi- 
sters 40. These include: load an address register from immediate data in the Instruction, load an address regi- 
ster from another register, perfonm arithmetic on the contents of an address register, or load an address register 
with data from the data store 12. 

Most of these operations use the source 2 address as the load source. Data can be loaded from another 

30 address register by specifying the source 2 addressing mode to be base plus displacement, specifying the other 
. register as the source 2 base register, and specifying the source 2 displacement to be zero. Similariy address 
register incrementing can be done by specifying a non-zero displacement and decrementing by specifying the 
displacement in two's complement notation. More complex address register arithmetic includes storing the 
address register contents to data store 12 and operating on it with the FPU and then loading the FPU result 

35 into the address register from data store 12. An immediate value can be loaded by specifying the value as the 
displacement and using an address register containing zero (0) as the base register. It is preferred that one of 
the global address registers 40 contain a zero (0). 

Storing the contents of an address register can be done by specifying: instruction opcode = any calculate 
type; source 1 addressing mode = base plus displacement; source 1 base register = register to store.source 

40 1 displacement ^ 0; source 1 1mmediate = 1 ; table type = None; FPU function = Pass source 1 . Both an address 
register load and an address register store can occur in the same instruction. This function can be useful for 
swapping address register data. 

Parallel mode programming bas one or more of the stage processors P1 to P6 executing an independent 
subprogram. Any of the parallel mode instructions may be used. Each subprogram should end with a "Parallel 

45 Calculate and Return" or a "Parallel calculate and Return Serialized" instruction. An optimization available is 
to combine loading an address register with a calculation in the same instruction. 

The pipeline apparatus 11 uses shared data store 12 for all concunrently running parallel subprograms. 
These parallel running subprograms can access shared read-only data with no constraints. Synchronization 
between the parallel subprogram execution is required to control updating shared writable data. In some nu- 

50 merical processing, the update to the shared data is of a serial kind. That is each subprogram needs to perform 
a simple operation on the data and the order these operations occur in is not significant As many numeric appli- 
cations need only this type of synchronization, the pipeline apparatus 1 1 provides a highly efficient means of 
doing it All the application needs to do is to use the serialized version of a calculate instruction and specify 
the location to be updated both as the destination and the source 1 address. Any FPU 30 operation can be 

55 used. In the usual case where no two subprograms are trying to update the same location at the same time 
there is no perfonmance penalty for using the serialized versus the non serialized calculate instruction. Thus 
most of the needed synchronization can be achieved with no programming overhead. For other applications, 
where ordering of global updates is important or when the update itself Is quite complex, usual multiprogram- 
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ming locking can be used, such as with a compare and swap oppration. A usual compare and swap operation 
compares the contents of a lock location against a specified compare value. If they are equal the lock is updated 
to a specified new lock value otherwise the new lock value is set to the existing contents of the lock location. 
A compare and swap operation can be implemented by the instmcttons (1) "Parallel Branch Serialized"; The 

5 address of the lock data Is specified as the source 1 address. The address of the compare data Is specified as 
the source 2 address. The FPU opcode is compare. branch address should be Instruction 4 and the Condition 
should branch If they are equal. (2) "Parallel Calculate Serialized"; This instruction is executed if the compare 
in instruction (1) was not equal and it should move the current lock data (specified as the source 1 address) to 
a new lock data location. (3) The "parallel branch" branches the subprogram around Instruction (4). (4) The 

10 parallel calculate serialized Instruction is executed If the compare in Instruction (1) above was equal. It should 
move the new lock data (specified as the source 2 address) to the lock data location (specified as both the 
source 1 and the destination addresses). A swap source 1 and source 2 bits (not shown) are set to move the 
data firom the source 2 address. This operation takes advantage of the fact that contiguous serialized instruc- 
tions with the same source 1 address prevent all other pipeline stages from obtaining serial access to the source 

15 1 Data while the instruction sequence is in progress. In this case, It prevents another pipeline stage from doing 
a compare once the instant compare has occurred and before the instructions have updated the lock data. 

Writing a program in pipeline mode applies all the pipeline stages to a single instruction stream. Any of the 
pipeline mode instructions may be used. If none of the processors is executing parallel subprograms then the 
same processor will execute every sixth instruction. There Is a classic problem with pipelines when an instruc- 

20 tion requires the result of a previous instruction that has not compi etely progressed all the way through pipel ine. 
Many machines have electronic circuits to detect this condition and to delay the execution of the instruction 
needing the data until it is available. When pipelined apparatus 1 1 does not detect this condition, it is then the 
responsibility of the program to ensure that the instructtons are scheduled (or spaced with no operation instruc- 
tions) so the data they use will be available. 

25 In branching, the "pipeline branch global" rather than the "pipeline branch" instruction is used for branches 
in a pipeline mode program. The pipeline branch instruction only branches the program counter of one stage 
processor whereas the pipeline branch global instruction branches the global program counter 34 which causes 
all of the stage processors P1 to P6 to follow the branch. As instructions from the pipeline mode program are 
started every pipeline cycle, the branch decision should occur in the first pipeline stage processor P1 so that 

30 the global program counter 34 can be updated In time for the next stage processor to execute the correct instruc- 
tion. However the FPU condition code is not available until the sbcth pipeline stage, so unlike the parallel branch 
instmctlon, the pipeline branch global instruction is conditional on the global flag bit in status register 90 rather 
than the FPU 30 generated results. To program the equivalent of the compare and branch capability of the par- 
allel branch Instruction the following pipeline mode instructions can be used: (1) pipeline set global flag: Do 

35 the compare and set the global flag if branch should be taken. (2) Five successive unrelated or "no-op" (no 
operation) Instructions are required to adhere to the pipeline constraints described above. (3) pipeline branch 
global: Branch if global flag bit of status register 90 is set to an active condition. An unconditional branch can 
be effected as with the branch global instruction. 

To Invoke parallel subprograms from the pipeline mode, a pipelined programmed application may use many 

40 parallel subprograms (or instances of the same program with different data) that can be concurrently executed. 
As each subprogram can use data dependent branches, it is not known until program execution how long each 
subprogram will take to execute. The pipelined apparatus 1 1 needs dynamically to assign each of the parallel 
mode subprograms to one of the stage processors so that the work will be evenly distributed amongst the stage 
processors. This dynamic assignment is effected through global program counter 34. The executing pipeline 

45 program uses a sequence of pipeline branch instructions. Each of these pipeline branch Instructions Initiates 
a different parallel subprogram to execute. The branch address specifies the parallel subprogram to be exec- 
uted (several pipeline branch instructions may specify the same parallel subprogram). The pipeline branch 
Instmctlon may also load an address register 40 with a parameter unique to that activation of the parallel sub- 
program. When the first pipeline branch instruction is executed, the first of stage processors P1 to P6 which is 

so In the pipeline mode branches to and begins to execute the first subprogram. The global program counter is 
incremented so that the next of stage processors P1 to P6 in the pipeline mode encounters the next ensuing 
pipeline branch instruction. This last-mentioned next stage processor then begins execution of the second sub- 
program in the next ensuing pipeline iteration. This execution of plural parallel subprogram continues until all 
six stage processors P1 to P6 are concunrently executing the first six parallel subprograms. At this point the 

55 global program counter 34 contains an Instruction address pointing to the seventh branch instruction. Global 
program counter 34 holds this seventh branch instruction address until one of the stage processors P1 to P6 
completes its respecthre subprogram and retums its mode of operation to the pipeline mode (encounters a par- 
allel return instruction, such as a calculate and retum instruction). When all pipeline stages are executing in 
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the parallel mode, the global program counter 34 is not incremented. With one pipeline stage now being in the 
pipeline mode, global program counter 34 supplies the Instruction address for the seventh branch for activating 
the one stage processor to execute the seventh parallel subprogram. The global program counter 34 is again 
Incremented to contain the instruction address of the next ensuing pipeline program instruction. In this manner 

5 each subprogram Is assigned to the first one of the stage processors P1 to P6 to become free (i.e. return to 
the pipeline mode indicating that the parallel subprogram it was executing has been finished) until the entire 
list of parallel subprograms is exhausted. At this point, as each stage processor completes Its final subprogram 
it retums to the pipeline program following the branch list for activating a plurality of parallel subprograms. 
If all the parallel subprograms must complete before the application can proceed, then the branch Instruo- 

10 tion sequence can be followed by a Pipeline Walt instruction. In this case some pipeline stages may be Idle 
while the remaining pipeline stages are completing their respective subprograms. To reduce the effect of this 
idling pipeline stages, the longer running parallel subprograms are positioned at the top of the parallel subprog- 
ram pipeline-branch list. The longer the fist of pipeline branches to the parallel mode, the more efficient the 
multiple parallel subprogram operation becomes. In perfomning vector operations, using the global index regi- 

15 ster 21 efficiently distributes the vector operations on each vector element (which are assumed to be indepen- 
dent) among the six pipeline stages P1 to P6. This kind of vector operation control is effected from within a 
pipeline mode program. If an application requires a vector operation within a parallel subprogram, it should be 
done by coding a usual program loop using a parallel branch Instruction. 

There are two types of vector operations. The first type Is a subroutine vector operation where more than 

20 one instruction is required to operate on each element of a vector. Examples of this type are a vector Sine oper- 
ation or calculations checking the convergence of a vector. The second type is an inline vector operation where 
only one instruction is used to operate on each element of a vector. An example of this type is adding two vec- 
tors. 

Subroutine vector operations are done in a manner similar to the parallel subprogram pipeline branch list, 

25 except that a single pipeline branch instruction is executed repeatedly by unsing the global index register 21 
to Inhibit incrementing of the global program counter 34. The repeated executions of the Instruction identified 
by the instruction address held in global program counter 34 can effect repeated incrementing of the source 
1, source 2 and destination addresses to effect the vector calculation. Each execution of the pipeline branch 
instruction causes a parallel subprogram to be run that computes the vector operation for a single element of 

30 the sector. The global index register 21 is also used to provide the index identifying which element of the vector 
is to computed by each Instance of the subprogram execution. Coding a subroutine vector operation requires 
two pipeline mode instructions plus the parallel subprogram or subroutine to operate on each element of the 
vector. The first pipeline mode instruction is a pipeline calculate and load global index register 21 instruction. 
The Index register 21 Is loaded with the vector length. This Instruction may also be used to do some other cal- 

35 culations. The second pipeline mode instruction is a pipeline branch instruction. The branch address is set to 
the address of the parallel subprogram or subroutine which operates on the vector elements. This second 
pipeline Instruction should also be setup to load one of the address registers 40 with the contents of the global 
index register 21. The activated parallel subprogram or subroutine is programmed as a usual parallel subprog- 
ram using the address register 40 loaded by the pipeline branch instruction to index into the vector. The vector 

40 base address would be typically in address register 40 which is loaded before the start of the vector operation. 
The parallel subprogram cause any parallel mode instructions including conditional branches such that the time 
required to compute each vector element may not be the same. This is no problem as the next vector element 
to be calculated is assigned to the next stage processor which Is in the pipeline mode regardless of the order 
in which the stage processore PI to P6 become available by returning to the pipeline mode after completing 

45 their respective parallel mode subprograms. If it is critical to the application that all elements of the vector oper- 
atbn be computed before the next ensuing pipeline instruction following the vector calculations is executed, 
then the vector operation should be followed by a pipeline wait instmctbn. 

In Inline vector operations, tiie global program counter 34 is inhibited from Incrementing by the global index 
register 21. Rather than code a branch instruction to be executed repeatedly, a pipeline calculate instruction 

5a performs the sector operation for a single element. As In the subroutine vector operation this Pipeline Calculate 
instmction also loads an address register 40 with the cunrent contents of tiie global index register 21 so that it 
can be used to index the vector data. As this address register 40 toad does not take place immediately, the 
vector calculation is treated as a pipeline. That is, a particular Instance of the pipeline calculate instruction is 
loading the address register 40 for the instance that will be executed two vector elements later. Hence the 

55 address register40 loading pipeline calculate instruction and the "using" pipeline calculate instruction are exec- 
uted by two different pipeline stages PI to P6 so an address register 40 suitable for global or pipelined oper- 
atk)ns stores tiie index value. When the vector-calculating pipeline operation stops, last two index values (2 
and 1) will have been loaded but not calculated on whereas when starting tiie firet two calculate instructions 
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the index value is not loaded. To solve this problem, simply ioad those two values into the address register 40 
In the two instructions before the calculate. Thus to program an inline vector operation, the following three 
pipeline instructions may be used: (1) pipeline calculate (load an address register 40; (2) pipeline calculate and 
load global Index register with vector length (load an address register 40); (3) pipeline calculate vector element 
5 operation (load the address register 40 with the contents of the global index register 21). This code requires a 
vector length of at least two. If necessary, an application can test for a vector length of one and specially handle 
It. 

Subroutine calls other than pipeline to parallel are facilitated by the ability to toad an address register 40 
with the address contents of a stage program counter 71 to 76 plus one. The subroutine uses this retum address 

10 by specifying a branch (or a branch global for pipeline mode) at the end of the subroutine using base plus dis- 
placement addressing mode for the destination address. 

Another use of the global flag brt of status register 90 besides the pipeline mode compare and branch oper- 
ations is In computing a condition that Is the composite of a series of concurrent calculations. An example of 
this situation is checking Newton convergence of a vector solution. Each element of the vector is checlced for 

IS convergence and If any of the elements does not converge then the Newton algorithm has not converged. To 
program this checicing, the check of a vector element is written as a parallel subprogram which could be invoked 
using the above-described vector or subprogram branch listings. The subprogram indicates that an element 
has not converged by setting the global flag bit of status register 90 (using a parallel set global flag instruction). 
If the global flag bit of status register 90 is set after all the subprograms checking sector convergence have 

20 completed, then the Newton algorithm has not converged. A pipeline branch global Instruction branches from 
a pipeline program to handle non-convergence of the calculations. Note that the branch global instruction resets 
the global flag bit of status register 90 so that a new test can be done after each pipeline branch test, i.e. like 
test and reset operations. 

As in the illustrative embodiment, pipeline mode instructions are required to Invoke parallel subprograms, 
25 every program will include at least one pipeline mode top level routine. Normally, to start a program, the host 
1 0 down loads the pipeline apparatus 1 1 instructions, sets the global program counter to point to the first instruc- 
tion, and starts the pipeline apparatus 11. The application does not require any special code at the beginning 
of the initial pipeline program. At the end of the pipeline program, the application will probably want to indicate 
to the host 1 0 that the pipeline program has completed, such as by the Pipeline IHost Signal instruction or inter- 
30 ruption circuits (not shown) may be employed. 

While the invention has heen particularly shown and described with reference to a preferred embodiment 
thereof, it will be understood by those skilled in the art that various changes in form and details may be made 
therein without departing from the scope of the invention. 

35 

Claims 

1. Apparatus for processing Instruction streams In a pipelined processor having a predetermined number of 
pipeline processing stages, including, in combination, first means for controlling the pipeline stages to pro- 

40 cess the predetenmined number of instruction streams simultaneously through the pipeline stages, second 

means in the first means for controlling the pipeline stages to process only a single Instruction stream 
through all the pipeline stages; and program means in each of the instruction streams and coupled to the 
pipeline stages and to the first and second means for program actuating the means for changing the oper- 
ation of the first and second means and the pipeline stages between the single stream and simultaneous 

45 instruction streams. 

2. Apparatus according to claim 1 , in which the first means Includes a stage program counter operatively con- 
nected to each pipeline stages, and the second means includes a global program counter connected to 
a multiplexor which In turn is connected to all the stage program counters for supplying instruction addres- 

50 ses to the stage program counters for executing a stream of instructions. 

3. Apparatus according to claim 2, In which there is a global index register for storing a vector indicatton con- 
nected to the global program counter and to the pipeline stages, the global program counter being res- 
ponsive to the global index register to supply the same instruction address to a predetennined number of 

55 the pipeline stages and the predetermined pipeline stages being responsive to the global index register 

to execute an instruction stream for calculating a plurality ofvaluesfora vector indicated in the global index 
register. 
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4. Apparatus according to claim 3, in wiiich tlie first means lias means for indicating that predetermined ones 
of the stage program counters are to generate instruction addresses, and the second means is connected 
to the indicating means and is responsive to the indicating means to supply instruction address only to 
those stage program counters not being indicated as generating instructions, such that the predetenmined 

5 one of the stage program counters generate instruction addresses in respective instmction streams and 

all others of the stage program counters receive instruction addresses from the global program counter 
such that some of the pipeline stages are operating in a pipeline mode sharing one instruction stream and 
the predetermined ones of the pipeline stages are operating in a parallel mode on their respective instruc- 
tion streams. 

10 

5. Apparatus according to any preceding claim, in which the program means in the single instruction stream 
Includes a first type of branch instruction for actuating the first and second means to effect the parallel 
mode in any one of the pipeline stages. 

15 6. Apparatus according to claim 5 in which the program means In each of the predetemnined number of 
instruction streams includes a second type of branch instruction which actuates the first and second means 
to change the parallel mode of operation for the pipeline stage executing the instruction to the pipeline 
mode. / 

20 7. Apparatus according to claim 2, or any daim appendant to daim 2, induding a processor counter con- 
nected to the pipeline stages and to the first means, for sequencing the stages in a predetenmined sequ- 
ence and directing the first means to supply the instructions addresses to the other pipeline stages on a 
first available basis irrespective of the instruction in the single stream of instructions which is to be next 
executed. 

25 

8. Apparatus according to daim 7, in which the processor counter has a count modulus equal to the pre- 
detenmined number and continuously cycles through the count modulus in a fixed sequence of counts, 
and the pipeline stages are interconnected from a first pipeline stage to a last pipeline stage and are 
addressable by the processor counter in the fixed sequence. 

30 

9. Apparatus according to claim 6, 7 or 8, in which the second type of branch instruction in any of the pre- 
detenmined number of instruction streams is a last instruction in such streams of instmctions. 

10. Apparatus according to claim 2, or any claim appendant to daim 2, in which the single stream of instruc- 
ts tions has a wait instruction which causes the first means to wait for completion of ali the then cunrent parallel 

mode stages to complete their respective ones of the predetenmined instruction streams and then supplies 
a next instruction address to one of the stage program counters. 

11. Apparatus according to any preceding daim, in which the single instruction stream has a first set of Instruc- 
40 tions and the predetemnined number of instruction streams has a second set of instructions, and all of the 

instructions, Induding branch Instructions having the same fonnat. 

^ 12. Apparatus according to any preceding daim, including a global flag In the first means, settable by any one 
of the processor stages when executing in the parallel mode, the first means being responsive to the global 
45 flag being set. to perform predetenfnined operations in the single instruction stream. 

13. Apparatus for operating a pipelined data processing system having a predetemnlned number of pipeline 
stages, induding, a stage program counter in each pipeline stage and each stage program counter count- 
ing instruction addresses independently of any other stage program counter, a global program counter 

so connected to all the stage program counters, and control means connected to all the program counters 

for selectively enabling predetenmined ones of the stage program counters to operate independently of 
the global program counter and enabling other than predetemnined ones of the stage program counters 
to receive Instruction addresses supplied by the global program counter for executing instructions in a 
sequence determined by the global program counter. 

55 

14. Apparatus according to claim 13, induding global program counter advancing means in the control means, 
operative to Increment the global program counter only for instruction addresses sent to the other stage 
program counters. 
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15. Apparatus according to daim 1 3 or 14, including vector means connected to the pipeline stages, the vector 
means enabling certain ones of the stage program counters to increment in a sequence for enabling vector 
calculations, and the control means being connected to the vector means for controlling the processor 
stages to perform vector calculations such that at any one time some of the stage program counters are 

5 receiving instruction addresses from the global program counter, some stage program counters are enabl- 

ing the vector calculations and others of the stage program counters are independently counting instruction 
addresses. 

16. Apparatus for pipelined data processing, including a plurality of pipeline stages each having a separate 
10 stage program counter, the stages being for independently executing a stream of instructions including 

incrementing the stage program counters, and global control means having a global program counter for 
supplying instruction addresses from the global program counter to predetermined ones of the stage pro- 
gram counters at predetermined times, and means for incrementing the global program counter each time 
an Instruction address is supplied to a one of the stage program counters. 

IS 

17. Apparatus according to claim 16, including means in each stage for Indicating that its stage program 
counter is being incremented for independent program execution in a parallel mode, and means in the 
global control means connected to all the indicating means and responsive to the indicated parallel mode 
to inhibit the global control means from supplying an instruction address to each stage program counter 

20 being Incremented In the parallel mode. 

18. Apparatus according to claim 17, Including means for erasing the parallel mode indications in the respec- 
tive pipeline stages. 

25 19. Apparatus according to daim 1 8, in which the erasing means is execution of an instruction by the respec- 
tive pipeline stage which branches operation from the parallel mode to the instruction counting of the global 
control means. 

20. Apparatus according to daim 17, 18 or 19, in which the global control means is first operable, such that 
30 all the pipeline stages are to receive instruction addresses from the global program counter in a pipeline 

mode, instructions identified by the instruction addresses supplied by the global pnDgram counter to the 
respective stage program counters include branch-to-parallel mode instructions to be executed by the res- 
pective pipeline stages, and the respective pipeline stages set the parallel mode indicator upon execution 
of each the branch-to-parallel mode instructions, such that some of the pipeline stages are in the pipeline 
35 mode while others are in a parallel mode. 

21. A method of operating a pipelined data processing apparatus having a plurality of pipeline stages, induding 
the machine-executed steps of executing a first stream of instructions in first predetermined ones of the 
pipeline stages, and concurrently executing a separate stream of instructions In second predetenmined 

40 ones of the pipeline stages. 

22. A method according to claim 21, induding the steps of storing a vector indicator, and concurrently with 
the first or any separate steams of instructions, executing a vector program in third predetemriined ones 
of the pipeline stages. 

45 

23. A method according to claim 21 or 22, in which the same format of instruction words is selected for all the 
streams of instructions. 

24. A method according to claim 22, or claim 23 as appendant to claim 22, Including dynamically switching 
50 the pipeline stages between the first, second and third predetemiined ones, such that at any given instant 

all the pipeline stages can be all the first, second or third predetermined ones or can ail be a predetenmined 
mbc of two of the first, second or third predetenmined ones. 

25. A method according to any of daims 21 to 24, including starting operation of the pipeline apparatus, such 
55 that all of the processor stages are the first predetermined ones. 

26. A method according to daim 22, or any claim appendant to claim 22, induding, while operating, holding 
operation of the pipeline stages that are first predetermined ones and completing operations of all pipeline 
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stages that are either second or third predetermined ones and maldng all of the pipeline stages first pre- 
detenmined ones. , 

27. A method of operating a pipeline data processing apparatus including the machine-executed steps of 
6 fetching an instruction to be executed in the pipeline apparatus, examining the instruction to ascertain 

whether it Is an Instruction to be executed in a pipeline eras a parallel program, and executing each instruc- 
tion in the mode Indicated such that in the pipeline apparatus instructions are being concurrently executed 
in either or both of the modes. 

10 28. A method according to claim 27, including the steps of generating a vector index, and executing a next 
instruction in the pipeline mode a number of times indicated by the generated index and indexing the data 
addresses of the instruction with each execution thereof, whereby vector calculations are performed in 
the pipeline mode. 
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